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ABSTRACT 


Linear Discriminant Analysis (LDA) is a well-known method for dimensionality reduction and classification. LDA in the 
binary-class case has been shown to be equivalent to linear regression with the class label as the output. This implies that 
LDA for binary class classification can be formulated as a least square problem. However many real-world applications 
involves multi-class classification, where a least square formulation for LDA is desirable. Previous studies have shown 
certain relationship between multivariate linear regression and LDA. Many of these studies show that multivariate linear 
regression with a specific class indicator matrix as the output can be applied as a pre-processing step for LDA. However, 


directly casting LDA as a least squares problems remains open for the multi-class case. 


In this paper used Fisher Linear Discriminant in an original space and finding the coefficients, compare these 
coefficients with the coefficients of least square method, to show that these methods are equivalent in directions, this 


equivalent happen when the statistics of Rayleigh Coefficient is maximized. 


By using the Iris dataset was introduced by R. A. Fisher as an example for discriminant analysis, that the data 
report four characteristics (sepal width, sepal length, pedal width and pedal length) of three species of Iris flower with the 


class label as output. We took just two species to explain the equivalent between LDA and LS. 
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1. INTRODUCTION 


Linear Discriminant Analysis (LDA) is a traditional statistical method which has proven successful on classification and 
dimensionality reduction problems “. The procedure is based on an Eigen value resolution and gives an exact solution of 


the maximum of the inertia but this method fails for a nonlinear problem. 


The original LDA formulation, known as the Fisher linear Discriminant Analysis (FLDA)” 


deals with binary- 
class classification. The key idea in (FLDA) is to look for a direction that separates the class mean well (when projected 
onto that direction) while achieving a small variance around these means. FLDA bears strong connections to linear 


(3, 10) 


regression with the class label as the output for classification. It has been shown that FLDA is equivalent to a least 


square problem. 


Fisher’s Linear Discriminant Analysis separates multivariate data with different classes nicely in the linear 


projection. In a two-class data separation, FDA tries to find the projection vector such that the between-class scatter matrix 
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is maximized and the within-class scatter matrix is minimized. Then the linear projection of this vector will ensure the 


greatest separatability for the two classes’ data”, 


The intuition behind Fisher’s linear Discriminant (FLD) consists of looking for a vector of compounds W such that, when 
a set of training samples are projected in to it, the class centres are far apart while the spread within each class is small, 


consequently producing a small overlap between classes”), This is done by maximizing a cost function known in some 
contexts as Rayleigh Coefficient, J (w). The data taken from “Edgar Anderson (1935). “The irises of the Gaspé 
Peninsula”. Bulletin of the American Iris Society 59: 2-5” ® 


Theoretical Part 


1. Linear Discriminant 


More formally one looking for a function f -X> R?, such that f (x) and f(z) are similar whenever 


xX and 7 are, and different otherwise. Similarity is usually measured by class membership and Euclidean distance. In the 


special case in the linear Discriminant analysis one is seeking a linear function, i.e. a set of projections 
T NxD 
f (x)= W’'x WeR™ 


where the matrix W is chosen, such that a contrast criterion G is optimized, in some cases with respect to a set 


of constraints S , i.e. 


max G(W) subject to W € S @) 


This setup is absolutely equivalent to e.g. principle component analysis where the contrast criterion would be that 


of maximal variance (or least mean square error) and the constraint set would be that of orthogonality of the matrix W 
However, PCA is an unsupervised technique and does not use any label. There is no principle that the direction found by 


PCA will be particularly discriminative. 


To simplify the presentation we will in the following only consider one-dimensional Discriminant functions, i.e. 
ae f= (w : x) ' sas 
is of the form . However, most results can easily be generalized to the multidimensional case.(10, 12) 


2. Fisher's Discriminant 


Probably the most well known example of a linear Discriminant is Fisher's Discriminant Fisher's idea was to look for a 
direction W that separates the class means well ( when project onto the found direction ) while achieving a small variance 
around these means®, The hope is that it is easy to decide for either of the two classes from this projection with a small 
error. The quantity measuring the difference between the means is called between class variance and the quantity 
measuring the variance around these class means is called within class variance, respectively. Then the goal is to find a 
direction that maximizes the between class variance while minimizing the within class variance at the same time. This is 


illustrated in Figure bellow. 
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Figure1: Fisher Discriminant Analysis" 


> > 
As shown in left graph, the two-class data is linearly projected onto a direction (vector) of 11— ™2, which is not 


a good separation as there are a lot data from two class overlap with each other. For the right graph, the two-class data are 


separated in a nice way that they have minimum overlapping. 


To describe this mathematically let X denote the space of observations (e.g. X C R” ) and Y the set of 
possible labels (here Y = {+1, -1}). Furthermore, let Z = {(x,,y, ), ates A Vy ke XxY denote the training 
sample of size M and denote by Zi = \(x, y) € Zly = i} and ae = \(x, y) € Zly = -1} the split in to the two 


classes of size M, = iz, | . Define M1, and M1, to be the empirical class means. i.e. 


Similarly, we can compute the means of the data projected onto some direction WwW by 


= ih r 
Mi; =a, 2, W"x 
xeZ; 


(2) 
= W'm. 


U 


: ae m, ; °.05 
i.e. the means H; of the projection means —_'!. The variances 9 192 oF the projected data can be expressed as 


o° = > (W'x-n,) 


xeZ; 


(3) 


Then maximizing the between class variance and minimizing the within class variance can be achieved by 


maximizing 


2 
G(W)= (u, —b,) 
0; +0; 
(4) 
Which will yield a direction W such that the ratio of between-class variance (i.e. separation) and within class 


variance (i.e. overlap) is maximal. Now, substituting the expression (2) for the means and the expression (3) for the 


variance into above equation (4) yields 
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T 
_ W'S2W 


AW) = airsiw 


(5) 


Where we define the between and within class scatter matrices S 3 and Sw as 
T 2 
S, =(m, -m, \m, -m, ) Sy => > (x-m,) (6) 
i=1,2xeZ; 
And the vector of coefficients as 


W = S,'(m, —m,) _ 


It is straight forward to check that (4) is absolutely equivalent to (5). This perfectly fits into the framework (1) 


with an empty constraints set S . The equation G(w) is often referred to as a Rayleigh coefficient.” olen 


3. Connection to Least Square 


The Fisher Discriminant problem described above bears strong connections to least squares approaches for classification. 


Classically, one is looking for a linear Discriminant function, now including a bias term, i.e. 


f(x)=W’x+b (9) 


_ x,) | 
such that on the training sample the sum of squares error between the outputs iA ( ‘/ and the known targets yi 


is small, i.e. in a (linear) least squares approach one minimizing the sum of square 


E(W.b)= Y(f(x)-y! = Y(W'x+b-y) 


(XyleZ (XyleZ (10) 


miny , E(W,d) 


The least squares problem can be written in matrix rotation as 


| (1) 


HS] 


where X = [x 1 X | is a matrix containing all training examples partitioned according to the labels +1, and 


2 
1, is a vector of once of corresponding length. The solution to a least rectangle problem of the form |Ax sal b can be 


‘é 4 =I 
computed by using the pseudo-inverse of A, ie. X* = A‘b= (a’A) A’b assuming that A’A is not singular. 
Then A'A=TJ and thus a necessary and sufficient condition for the solution X*to the least square problem 


is (A’A)x’ = A’b. Appling this to (11) yields 
X, X,|[/xX7 1, ][w]_[X, xX, ][-1 
x? ile) ae. AE la, 
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multiplying these matrices and using the definition of the sample means and within class scatter for Fisher yields: 


Sy +M,m,m} Mm,+M.m, Ww _ Mom,-Mm, (12) 
(Mm,+M.m,)  M,+M, b M,-M, 


Using the second equation in (12) to solve for b yields 


T 
M,-M —(M,m, + Mm, ) Ww 


= 2 1 
M,+M, 


(13) 


Substituting this into first equation of (12) and using a few algebraic manipulations, especially the relation 


2 
= ans = ae. one obtains: 


(14) 


Now, since still SpWw is in the direction of (m, —m, ), there exists a scalar OL € R such that 


MM M24m2 
125 we j 2 «om, m,) (15) 
eae) 


Then using (15) in (14) yields: 
SW =a(m,—-m,) = W =aS,'(m,-m,). (16) 


This shows that the solution to the least square problem is in the same direction as the solution of Fisher's 


Discriminant, although it will have a different length. But as we already noticed, we are only interested in the direction 


on W , not its length and hence the solutions are identical.07*10') 
Practical Part 


This paper has been prepared to clear that the LDA is equivalent to least square regression entering the value of 
compounds W and a statistic Rayleigh coefficient measure the ration of projected class means to projected intra-class 
variance we obtain the optimal solution, means maximizing the statistic Rayleigh coefficient when a set of training sample 
are projected into it the class centres are far apart while the spread within each class is small, by using packages SPSS and 


MATLAB, and for the data see Appendix (A). 


From equation (7), for LDA, the vector of coefficients i.e. Standardized Canonical Discriminant Function 


Coefficients are: 


— 583 

_| =.303 

11.069 
547 
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and from equation (5) the value of statistic Rayleigh Coefficient 
G(W) = 88.3829. 
From equation (16), for LS, the vector of coefficients are 


.150 

.050 
W = 

—.765 


=o) 


and from equation (5) the value of statistic Rayleigh coefficient G(w) = 76.1694 


It is clear that from the values of both two coefficients and statistic of Rayleigh coefficients of Linear 
Discriminant and least Squares the separate of between-class scatter matrix is maximized and the within-class scatter 
matrix is minimized that means there are an equivalent between them, because they have the same direction, although it 
will have a different length. But as we already noticed, we are only interested in the direction on W,, not its length. To be 


certain from table (1) bellow of classification result from SPSS analysis clear that 100.0% of original grouped cases 


correctly classified in this data. In general to be the equivalent is strong there must be the ratio of misclassification is small. 


Table 1: Classification Results 


Predicted Group 
i a 
SP Total 
-1 


a. 100.0% of original grouped cases correctly classified. 


CONCLUSION 


From the analysis of Iris Flower dataset was introduced by R. A. Fisher” as an example for Discriminant analysis, that the 
data report four characteristics (sepal width, sepal length, pedal width and pedal length) of three species of Iris flower with 
the class label as output. We took just two species to explain the equivalent between LDA and LS clear that there is an 
equivalent between the Linear Fisher Diacriminant and Least Squares method means that the separate of between-class 
scatter matrix is maximized and the within-class scatter matrix is minimized, because they have the same direction, 


although it will have a different length. 
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Appendix (A) 


Iris flowers dataset was introduced by R. A. Fisher ® as an example for Discriminant analysis, that the data report four 


characteristics (sepal width, sepal length, pedal width and pedal length) of three species of Iris flower with the class label 


as output. We took just two species to explain the equivalent between LDA and LS. 


Table 2 
Fisher's Iris Data 
be Sepal Length *Sepal Width “Petal Length Petal Width = Species 
5.1 3.5 1.4 0.2 setosa 
4.9 3.0 1.4 0.2 setosa 
4.7 3.2 1.3 0.2 setosa 
4.6 3.1 1.5 0.2 setosa 
5.0 3.6 1.4 0.2 setosa 
5.4 3.9 1.7 0.4 setosa 
4.6 3.4 1.4 0.3 setosa 
5.0 3.4 1.5 0.2 setosa 
4.4 2.9 1.4 0.2 setosa 
4.9 3.1 1.5 0.1 setosa 
5.4 3.7 1.5 0.2 setosa 
4.8 3.4 1.6 0.2 setosa 
4.8 3.0 1.4 0.1 setosa 
4,3 3.0 1.1 0.1 setosa 
5.8 4.0 1.2 0.2 setosa 
5.7 4.4 1.5 0.4 setosa 
5.4 3.9 1.3 0.4 setosa 
5.1 3.5 1.4 0.3 setosa 
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5.7 3.8 1.7 0.3 setosa 
5.1 3.8 1.5 0.3 setosa 
5.4 3.4 1.7 0.2 setosa 
5.1 3.7 1.5 0.4 setosa 
4.6 3.6 1.0 0.2 setosa 
5.1 3.3 1.7 0.5 setosa 
4.8 3.4 1.9 0.2 setosa 
5.0 3.0 1.6 0.2 setosa 
5.0 3.4 1.6 0.4 setosa 
5.2 3.5 1.5 0.2 setosa 
5.2 3.4 1.4 0.2 setosa 
4.7 3.2 1.6 0.2 setosa 
4.8 3.1 1.6 0.2 setosa 
5.4 3.4 1.5 0.4 setosa 
a.2 4.1 1.5 0.1 setosa 
5.5 4,2 1.4 0.2 setosa 
4.9 3.1 1.5 0.2 setosa 
5.0 3.2 1.2 0.2 setosa 
pee) 3.5 1.3 0.2 setosa 
4.9 3.6 1.4 0.1 setosa 
4.4 3.0 1.3 0.2 setosa 
5.1 3.4 1.5 0.2 setosa 
5.0 3.5 1.3 0.3 setosa 
4.5 2.3 1.3 0.3 setosa 
4.4 3.2 1.3 0.2 setosa 
5.0 3.5 1.6 0.6 setosa 
5.1 3.8 1.9 0.4 setosa 
4.8 3.0 1.4 0.3 setosa 
5.1 3.8 1.6 0.2 setosa 
4.6 3.2 1.4 0.2 setosa 
5.3 3.7 1.5 0.2 setosa 
5.0 3.3 1.4 0.2 setosa 
7.0 3.2 4.7 1.4 versicolor 
6.4 3.2 4.5 1.5 versicolor 
6.9 3.1 4.9 1.5 versicolor 
5.5 2.3 4.0 1.3 versicolor 
6.5 2.8 4.6 1.5 versicolor 
5.7 2.8 4.5 1.3 versicolor 
6.3 3.3 4.7 1.6 versicolor 
4.9 2.4 3.3 1.0 versicolor 
6.6 2.9 4.6 1.3 versicolor 
5.2 2.7 3.9 1.4 versicolor 
5.0 2.0 3.5 1.0 versicolor 
5.9 3.0 4,2 1.5 versicolor 
6.0 2.2 4.0 1.0 versicolor 
6.1 2.9 4.7 1.4 versicolor 
5.6 2.9 3.6 1.3 versicolor 
6.7 3.1 4.4 1.4 versicolor 
5.6 3.0 4.5 1.5 versicolor 
5.8 2.7 4.1 1.0 versicolor 
6.2 2.2 4.5 1.5 versicolor 
5.6 2.5 3.9 1.1 versicolor 
5.9 3.2 4.8 1.8 versicolor 
6.1 2.8 4.0 1.3 versicolor 
6.3 2.5 4.9 1.5 versicolor 
6.1 2.8 4.7 1.2 versicolor 
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6.4 2.9 4,3 1.3 versicolor 
6.6 3.0 4.4 1.4 versicolor 
6.8 2.8 4.8 1.4 versicolor 
6.7 3.0 5.0 1.7 versicolor 
6.0 2.9 4.5 1.5 versicolor 
5.7 2.6 3.5 1.0 versicolor 
5.5 2.4 3.8 1.1 versicolor 
5.5 2.4 3.7 1.0 versicolor 
5.8 201 3.9 1.2 versicolor 
6.0 2.7 5.1 1.6 versicolor 
5.4 3.0 4.5 1.5 versicolor 
6.0 3.4 4.5 1.6 versicolor 
6.7 3.1 47 1.5 versicolor 
6.3 2.3 4.4 1.3 versicolor 
5.6 3.0 41 1.3 versicolor 
5.5 2.5 4.0 1.3 versicolor 
5.5 2.6 4.4 1.2 versicolor 
6.1 3.0 4.6 1.4 versicolor 
5.8 2.6 4.0 1.2 versicolor 
5.0 2.3 3.3 1.0 versicolor 
5.6 2.7 4,2 1.3 versicolor 
5.7 3.0 4,2 1.2 versicolor 
a7 2.9 4,2 1.3 versicolor 
6.2 2.9 4,3 1.3 versicolor 
5.1 2.5 3.0 1.1 versicolor 
5.7 2.8 4.1 1.3 versicolor 
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